Speeding Up q-Gram Mining on Grammar-Based Compressed Texts
نویسندگان
چکیده
We present an efficient algorithm for calculating q-gram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP T of size n that represents string T , the algorithm computes the occurrence frequencies of all q-grams in T , by reducing the problem to the weighted q-gram frequencies problem on a trie-like structure of size m = |T | − dup(q, T ), where dup(q,T ) is a quantity that represents the amount of redundancy that the SLP captures with respect to q-grams. The reduced problem can be solved in linear time. Since m = O(qn), the running time of our algorithm is O(min{|T |−dup(q, T ), qn}), improving our previous O(qn) algorithm when q = Ω(|T |/n).
منابع مشابه
Algorithms and data structures for grammar - compressed strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملData Structures for Grammar-compressed Strings
This thesis presents new algorithms and data structures for handling data represented as grammar-compressed strings. The compression scheme we focus on is the Straight Line Program (SLP). In the following, S is an SLP of size n compressing a string S of size N . We consider the following problems. The q-gram profile of a compressed string. We present an algorithm for computing the q-gram profil...
متن کاملComputing q-Gram Non-overlapping Frequencies on SLP Compressed Texts
Length-q substrings, or q-grams, can represent important characteristics of text data, and determining the frequencies of all qgrams contained in the data is an important problem with many applications in the field of data mining and machine learning. In this paper, we consider the problem of calculating the non-overlapping frequencies of all q-grams in a text given in compressed form, namely, ...
متن کاملCompact q-Gram Profiling of Compressed Strings
We consider the problem of computing the q-gram profile of a string T of size N compressed by a context-free grammar with n production rules. We present an algorithm that runs in O(N ↵) expected time and uses O(n+kT,q) space, where N ↵ qn is the exact number of characters decompressed by the algorithm and kT,q N ↵ is the number of distinct q-grams in T . This simultaneously matches the curr...
متن کاملFast q-gram Mining on SLP Compressed Strings
We present simple and efficient algorithms for calculating qgram frequencies on strings represented in compressed form, namely, as a straight line program (SLP). Given an SLP of size n that represents string T , we present an O(qn) time and space algorithm that computes the occurrence frequencies of all q-grams in T . Computational experiments show that our algorithm and its variation are pract...
متن کامل